{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "197531c2",
   "metadata": {},
   "source": [
    "# Notebook 3: Evaluation with Ragas\n",
    "\n",
    "\n",
    "Leveraging a strong LLM for reference-free evaluation is an upcoming solution that has shown a lot of promise. They correlate better with human judgment than traditional metrics and also require less human annotation. Papers like G-Eval have experimented with this and given promising results but there are certain shortcomings too.\n",
    "\n",
    "LLM prefers their own outputs and when asked to compare between different outputs the relative position of those outputs matters more. LLMs can also have a bias toward a value when asked to score given a range and they also prefer longer responses.\n",
    "\n",
    "[Ragas](https://docs.ragas.io/en/latest/) aims to work around these limitations of using LLMs to evaluate your QA pipelines while also providing actionable metrics using as little annotated data as possible, cheaper, and faster.\n",
    "\n",
    "In this notebook, we will use NVIDIA AI playground's  Llama 70B LLM as a judge and eval model. **NVIDIA AI Playground** on NGC allows developers to experience state of the art LLMs accelerated on NVIDIA DGX Cloud with NVIDIA TensorRT nd Triton Inference Server. Developers get **free credits for 10K requests** to any of the available models. Sign up process is easy. Follow the instructions [here.](../docs/rag/aiplayground.md)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f61f7115",
   "metadata": {},
   "source": [
    "### Step 1: Set NVIDIA AI Playground API key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3028e418",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip install ragas --upgrade"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c17e93c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"\"\n",
    "import sys\n",
    "from continuous_eval.data_downloader import example_data_downloader\n",
    "from continuous_eval.evaluators import RetrievalEvaluator, GenerationEvaluator\n",
    "from datasets import Dataset\n",
    "import json\n",
    "import pandas as pd\n",
    "from continuous_eval.metrics import PrecisionRecallF1, RankedRetrievalMetrics, DeterministicAnswerCorrectness, DeterministicFaithfulness, BertAnswerRelevance, BertAnswerSimilarity, DebertaAnswerScores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98b62314-9a33-44ae-aac6-fa9360f4f610",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings\n",
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "os.environ['NVIDIA_API_KEY'] = \"\"\n",
    "os.environ['NVAPI_KEY'] = \"\"\n",
    "# make sure to export your NVIDIA AI Playground key as NVIDIA_API_KEY!\n",
    "llm = ChatNVIDIA(model=\"playground_steerlm_llama_70b\")\n",
    "nv_embedder = NVIDIAEmbeddings(model=\"nvolveqa_40k\")\n",
    "nv_document_embedder = NVIDIAEmbeddings(model=\"nvolveqa_40k\", model_type=\"passage\")\n",
    "nv_query_embedder = NVIDIAEmbeddings(model=\"nvolveqa_40k\", model_type=\"query\")\n",
    "\n",
    "model_name = \"intfloat/e5-large-v2\"\n",
    "model_kwargs = {\"device\": \"cuda\"}\n",
    "encode_kwargs = {\"normalize_embeddings\": True}\n",
    "e5_embeddings = HuggingFaceEmbeddings(\n",
    "    model_name=model_name,\n",
    "    model_kwargs=model_kwargs,\n",
    "    encode_kwargs=encode_kwargs,\n",
    ")\n",
    "\n",
    "from langchain.embeddings import HuggingFaceBgeEmbeddings\n",
    "model_name = \"facebook/dragon-plus-context-encoder\"\n",
    "model_kwargs = {'device': 'cuda'}\n",
    "encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity\n",
    "dragon_embeddings = HuggingFaceBgeEmbeddings(\n",
    "    model_name=model_name,\n",
    "    model_kwargs=model_kwargs,\n",
    "    encode_kwargs=encode_kwargs,\n",
    "    query_instruction=\"Represent this sentence for searching relevant passages: \"\n",
    ")\n",
    "dragon_embeddings.query_instruction = \"Represent this sentence for searching relevant passages: \""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f41345c-a7fe-45f2-bc4d-9c86a25eb5a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(llm.available_models)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "550f6001",
   "metadata": {},
   "source": [
    "### Bring your own LLMs¶\n",
    "Ragas uses langchain under the hood for connecting to LLMs for metrices that require them. This means you can swap out the default LLM (gpt-3.5) with llama2 70B from AI playground."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04eb374d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragas.llms import LangchainLLM\n",
    "nvpl_llm = LangchainLLM(llm=llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a35a02c-7a8a-4eda-aa03-995f5364dcd1",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# for i in data_samples['retrieved_contexts']:\n",
    "#     print((i))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea3e88f8",
   "metadata": {},
   "source": [
    "### Step 2: Import Eval Data and Reformat It.\n",
    "#### Lets start with E5 evals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f32a57b-6590-4c3f-b728-5b1e7dbbd798",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "evals_file = '../synthetic_data_openai.json'\n",
    "with open(evals_file, 'r') as file:\n",
    "    json_data_syn_e5 = json.load(file)\n",
    "print(json_data_syn_e5[10])\n",
    "eval_questions = []\n",
    "eval_answers = []\n",
    "ground_truths = []\n",
    "vdb_contexts = []\n",
    "gt_contexts = []\n",
    "counter = 0\n",
    "for entry in json_data_syn_e5:\n",
    "    entry[\"contexts\"] = [str(r) for r in entry[\"contexts\"]]\n",
    "    entry[\"gt_answer\"] = str(entry[\"gt_answer\"])\n",
    "    eval_questions.append(entry[\"question\"])\n",
    "    eval_answers.append(entry[\"answer\"])\n",
    "    vdb_contexts.append(entry[\"contexts\"][0:3])\n",
    "    ground_truths.append([entry[\"gt_answer\"]])\n",
    "    gt_contexts.append([entry[\"gt_context\"]])\n",
    "\n",
    "data_samples = {\n",
    "    'question': eval_questions,\n",
    "    'answer': eval_answers,\n",
    "    'retrieved_contexts' : vdb_contexts,\n",
    "    'ground_truths': ground_truths,\n",
    "    'ground_truth_contexts':gt_contexts\n",
    "}\n",
    "\n",
    "dataset_syn_E5 = pd.DataFrame.from_dict(data_samples) # Dataset.from_dict(data_samples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9608a925-1a73-43bd-a207-adf5f709c4b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use continous evals library to generate retriever/generator metrics\n",
    "evaluator = RetrievalEvaluator(\n",
    "    dataset=dataset_syn_E5,\n",
    "    metrics=[\n",
    "        PrecisionRecallF1(),\n",
    "        RankedRetrievalMetrics(),\n",
    "    ],\n",
    ")\n",
    "# Run the eval!\n",
    "r_results = evaluator.run(k=3,batch_size=50)\n",
    "# Peaking at the results\n",
    "print(evaluator.aggregated_results)\n",
    "# Saving the results for future use\n",
    "# evaluator.save(\"retrieval_evaluator_results.jsonl\")\n",
    "\n",
    "evaluator = GenerationEvaluator(\n",
    "    dataset=dataset_syn_E5,\n",
    "    metrics=[\n",
    "        DeterministicAnswerCorrectness(),\n",
    "        DeterministicFaithfulness(),\n",
    "        # DebertaAnswerScores()\n",
    "    ],\n",
    ")\n",
    "# Run the eval!\n",
    "g_results = evaluator.run(batch_size=50)\n",
    "# Peaking at the results\n",
    "print(evaluator.aggregated_results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70f46aa3-00b0-4476-8ff6-b3de6a49caf8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert results to dataframe\n",
    "g_results[0]\n",
    "df_g_results = pd.DataFrame(g_results)\n",
    "df_g_results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec01be7a",
   "metadata": {},
   "source": [
    "### Step 3: View and Interpret Results with RAGAS\n",
    "\n",
    "A Ragas score is comprised of the following:\n",
    "![ragas](imgs/ragas.png)\n",
    "\n",
    "#### Metrics explained \n",
    "1. **Faithfulness**: measures the factual accuracy of the generated answer with the context provided. This is done in 2 steps. First, given a question and generated answer, Ragas uses an LLM to figure out the statements that the generated answer makes. This gives a list of statements whose validity we have we have to check. In step 2, given the list of statements and the context returned, Ragas uses an LLM to check if the statements provided are supported by the context. The number of correct statements is summed up and divided by the total number of statements in the generated answer to obtain the score for a given example.\n",
    "   \n",
    "2. **Answer Relevancy**: measures how relevant and to the point the answer is to the question. For a given generated answer Ragas uses an LLM to find out the probable questions that the generated answer would be an answer to and computes similarity to the actual question asked.\n",
    "   \n",
    "3. **Context Relevancy**: measures the signal-to-noise ratio in the retrieved contexts. Given a question, Ragas calls LLM to figure out sentences from the retrieved context that are needed to answer the question. A ratio between the sentences required and the total sentences in the context gives you the score\n",
    "\n",
    "4. **Context Recall**: measures the ability of the retriever to retrieve all the necessary information needed to answer the question. Ragas calculates this by using the provided ground_truth answer and using an LLM to check if each statement from it can be found in the retrieved context. If it is not found that means the retriever was not able to retrieve the information needed to support that statement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4428051",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "with open(evals_file, 'r') as file:\n",
    "    json_data = json.load(file)\n",
    "eval_questions = []\n",
    "eval_answers = []\n",
    "ground_truths = []\n",
    "vdb_contexts = []\n",
    "# gt_contexts = []\n",
    "counter = 0\n",
    "for entry in json_data:\n",
    "    # print(type([entry[\"gt_context\"]]))\n",
    "    entry[\"contexts\"] = [str(r) for r in entry[\"contexts\"]]\n",
    "    entry[\"gt_answer\"] = str(entry[\"gt_answer\"])\n",
    "    eval_questions.append(entry[\"question\"])\n",
    "    eval_answers.append(entry[\"answer\"])\n",
    "    vdb_contexts.append(entry[\"contexts\"][0:3])\n",
    "    ground_truths.append([entry[\"gt_answer\"]])\n",
    "    # gt_contexts.append([entry[\"gt_context\"]])\n",
    "\n",
    "data_samples = {\n",
    "    'question': eval_questions,\n",
    "    'answer': eval_answers,\n",
    "    'contexts' : vdb_contexts,\n",
    "    'ground_truths': ground_truths,\n",
    "    # 'ground_truth_contexts':gt_contexts\n",
    "}\n",
    "\n",
    "dataset_syn_e5 = Dataset.from_dict(data_samples)\n",
    "dataset_syn_e5[57]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6300248f-b0a6-4665-9443-923907cc6321",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragas.metrics import (\n",
    "    answer_relevancy,\n",
    "    faithfulness,\n",
    "    context_recall,\n",
    "    context_precision,\n",
    "    answer_correctness,\n",
    "    ContextRecall,\n",
    ")\n",
    "from ragas.metrics import Faithfulness\n",
    "from ragas.metrics import AnswerCorrectness\n",
    "from ragas.metrics import ContextRelevancy\n",
    "from ragas.metrics import AnswerRelevancy\n",
    "\n",
    "context_relevancy = ContextRelevancy(llm=nvpl_llm)\n",
    "\n",
    "faithfulness = Faithfulness(\n",
    "    batch_size = 15\n",
    ")\n",
    "faithfulness.llm = nvpl_llm\n",
    "\n",
    "answer_correctness = AnswerCorrectness(llm=nvpl_llm,\n",
    "    weights=[0.4,0.6]\n",
    ")\n",
    "answer_correctness.llm = nvpl_llm\n",
    "\n",
    "context_recall = ContextRecall(llm=nvpl_llm)\n",
    "context_recall.llm = nvpl_llm\n",
    "\n",
    "context_precision.llm = nvpl_llm\n",
    "\n",
    "\n",
    "# answer_correctness.embeddings = nv_query_embedder\n",
    "## using NVIDIA embedding\n",
    "from ragas.metrics import AnswerSimilarity\n",
    "\n",
    "answer_similarity = AnswerSimilarity(llm=nvpl_llm, embeddings=nv_query_embedder)\n",
    "answer_relevancy = AnswerRelevancy(embeddings=nv_query_embedder,llm=nvpl_llm) #embeddings=nv_query_embedder,\n",
    "# init_model to load models used\n",
    "answer_relevancy.init_model()\n",
    "answer_correctness.init_model()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5ab58e2-d12d-477a-9bc0-91daffb13a12",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from ragas import evaluate\n",
    "# openai.api_key = os.getenv(INSERT YOUR OPENAI_API_KEY)\n",
    "results_1 = evaluate(dataset_syn_e5,metrics=[faithfulness,answer_similarity,context_precision,answer_relevancy,context_relevancy])\n",
    "# results_2 = evaluate(dataset_syn_e5,metrics=[answer_relevancy])\n",
    "# results_3 = evaluate(dataset_syn_e5,metrics=[context_precision])\n",
    "# results_4 = evaluate(dataset_syn_e5,metrics=[context_relevancy])\n",
    "results_5 = evaluate(dataset_syn_e5,metrics=[context_recall])\n",
    "# results_6 = evaluate(dataset_syn_e5,metrics=[answer_correctness])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62841008-2c90-41f6-8e36-0f98fa3d4621",
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = results_5.to_pandas()\n",
    "df2\n",
    "results_5['context_recall']=df2['context_recall'].mean(skipna=True)\n",
    "print(results_5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49f169ac-8c25-4582-8113-1f942c9d6f94",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df = results_1.to_pandas()\n",
    "df\n",
    "# results\n",
    "df_merge = pd.concat([df, df2['context_recall'], df_g_results], axis = 1)\n",
    "results_1.update(evaluator.aggregated_results)\n",
    "results_1.update(results_5)\n",
    "del results_1['rouge_faithfulness']\n",
    "del results_1['token_overlap_faithfulness']\n",
    "del results_1['bleu_faithfulness']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2db28de-b2fa-4939-96fd-ee0489a96ea2",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df_merge['ground_truths'] = df_merge['ground_truths'].apply(', '.join)\n",
    "df_merge['contexts'] = df_merge['contexts'].apply(', '.join)\n",
    "df_merge = df_merge.drop(columns=['rouge_faithfulness', 'token_overlap_faithfulness','bleu_faithfulness','rouge_p_by_sentence','token_overlap_p_by_sentence','blue_score_by_sentence'])\n",
    "df_merge.to_pickle('metrics_df.pkl')\n",
    "df_merge"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae684dde-7af5-4124-b8d4-6698dfa46da3",
   "metadata": {},
   "source": [
    "#### Plot base evals with the above Metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ec2e4f4-7efb-4e5e-b546-ab647d2333a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# @title\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams.update({'font.size': 14})\n",
    "plt.rcParams[\"backend\"]='agg'\n",
    "def plot_metrics_with_values(metrics_dict, title='RAG Metrics',figsize=(10, 6),name='save.png'):\n",
    "    \"\"\"\n",
    "    Plots a bar chart for metrics contained in a dictionary and annotates the values on the bars.\n",
    "    Args:\n",
    "    metrics_dict (dict): A dictionary with metric names as keys and values as metric scores.\n",
    "    title (str): The title of the plot.\n",
    "    \"\"\"\n",
    "    names = list(metrics_dict.keys())\n",
    "    values = list(metrics_dict.values())\n",
    "    plt.figure(figsize=figsize)\n",
    "    bars = plt.barh(names, values, color='green')\n",
    "    # Adding the values on top of the bars\n",
    "    for bar in bars:\n",
    "        width = bar.get_width()\n",
    "        plt.text(width - 0.2,  # x-position\n",
    "                 bar.get_y() + bar.get_height() / 2,  # y-position\n",
    "                 f'{width:.4f}',  # value\n",
    "                 va='center')\n",
    "    plt.xlabel('Score')\n",
    "    plt.title(title)\n",
    "    plt.xlim(0, 1)  # Setting the x-axis limit to be from 0 to 1\n",
    "    # plt.show()\n",
    "    plt.savefig(name, bbox_inches='tight', dpi=150)\n",
    "    # bars = plt.bar(names, values, color='skyblue')\n",
    "    # # Adding the values on top of the bars\n",
    "    # for bar in bars:\n",
    "    #     width = bar.get_height()\n",
    "    #     # plt.text(width + 0.01,  # x-position\n",
    "    #     #          bar.get_x() + bar.get_width() / 2,  # y-position\n",
    "    #     #          f'{width:.4f}',  # value\n",
    "    #     #          va='center')\n",
    "    # plt.ylabel('Score')\n",
    "    # plt.title(title)\n",
    "    # plt.xticks(rotation=30)\n",
    "    # plt.ylim(0, 1)  # Setting the x-axis limit to be from 0 to 1\n",
    "    # plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "07f48d91-90fe-4e7e-8b5b-fd4853f5ed9b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "456414b6-870c-4cd0-b6e0-51854f1afe30",
   "metadata": {},
   "outputs": [],
   "source": [
    "generator_metrics = ['faithfulness','answer_relevancy']\n",
    "retriever_metrics = ['context_precision','context_relevancy','context_recall']\n",
    "endtoend = ['answer_similarity','token_overlap_f1','token_overlap_precision','token_overlap_recall','rouge_l_f1','rouge_l_precision','rouge_l_recall','bleu_score']\n",
    "results_generator = {}\n",
    "results_retriever = {}\n",
    "results_endtoend = {}\n",
    "for i in generator_metrics:\n",
    "    results_generator[i] = results_1[i]\n",
    "for i in retriever_metrics:\n",
    "    results_retriever[i] = results_1[i]\n",
    "for i in endtoend:\n",
    "    results_endtoend[i] = results_1[i]\n",
    "print(results_generator)\n",
    "print(results_retriever)\n",
    "print(results_endtoend)\n",
    "plot_metrics_with_values(results_generator, \"Base Generator Metrics\",figsize=(6, 3),name='gen.jpg')\n",
    "plot_metrics_with_values(results_retriever, \"Base Retriever Metrics\",figsize=(6, 3),name='ret.jpg')\n",
    "plot_metrics_with_values(results_endtoend, \"Base End-to-End Metrics\",figsize=(10, 6),name='end.jpg')\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
